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This paper is devoted to studying constrained continuous-time 
Markov decision processes (MDPs) in the class of randomized poli- 
cies depending on state histories. The transition rates may be un- 
bounded, the reward and costs are admitted to be unbounded from 
above and from below, and the state and action spaces are Polish 
spaces. The optimality criterion to be maximized is the expected dis- 
counted rewards, and the constraints can be imposed on the expected 
discounted costs. First, we give conditions for the nonexplosion of 
underlying processes and the finiteness of the expected discounted 
rewards/costs. Second, using a technique of occupation measures, we 
prove that the constrained optimality of continuous-time MDPs can 
be transformed to an equivalent (optimality) problem over a class 
of probability measures. Based on the equivalent problem and a 
so-called w-weak convergence of probability measures developed in 
this paper, we show the existence of a constrained optimal policy. 
Third, by providing a linear programming formulation of the equiva- 
lent problem, we show the solvability of constrained optimal policies. 
Finally, we use two computable examples to illustrate our main re- 
sults. 

1. Introduction. Constrained Markov decision processes (MDPs) form 
an important class of stochastic control problems and have been widely 
studied. Existing works on constrained MDPs can be roughly classified into 
four groups: (i) constrained discrete-time MDPs with denumerable states 
[1, 2, 6-10, 23, 25, 37, 38, 41] and their extensive references, (ii) constrained 
discrete-time MDPs with a Polish state space [19, 20, 29, 33] and their bibli- 
ographies, (hi) constrained continuous-time MDPs with denumerable states 
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[13, 15, 34, 36, 42], and (iv) constrained continuous-time MDPs with a Polish 
state space [11]. A review of these references shows that most of the related 
literature is concentrated with the first three groups. To the best of our 
knowledge, the fourth group is addressed only in [11] for the average criteria. 
Concerning group (i), the existence and algorithms of constrained optimal 
policies are given in [6-10] for variant discounted criteria when states and 
actions are finite, in [1, 25, 37] for the discounted criteria and denumerable 
states, and in [1, 2, 23, 37, 38] for the average criteria and denumerable 
states. Also, the existence of constrained optimal policies and linear pro- 
gramming formulation for group (ii) are given in [19, 33] for the discounted 
criteria and in [20, 29, 33] for the average criteria. Although group (iii) has 
been studied in [13, 15, 34, 36, 42], the references [13, 15, 34, 36, 42] deal with 
the case of a single constraint, the transition rates in [34] are assumed to 
be bounded, and the assumption of denumerable states in these references 
cannot be dropped. On the other hand, as mentioned above, constrained 
MDPs in Polish spaces are also studied in [19, 20, 29, 33] for the discrete- 
time case and in [11] for the continuous-time case. However, the reward and 
cost functions in [29] are assumed to be all bounded, and all cost functions 
in [11, 19, 20, 33] are assumed to be essentially nonnegative. Further, such 
nonnegativeness assumption cannot be removed because it is required for 
the use of the standard weak convergence of probability measures. This in 
turn implies that the constrained optimality problem of minimizing non- 
negative costs in [11, 19, 20] with constraints imposed on other nonnegative 
costs cannot be transformed to an equivalent optimality problem of max- 
imizing bounded rewards as in [29] with constraints imposed on bounded 
costs. Hence, the constrained discrete and continuous time MDPs with Pol- 
ish spaces, in which rewards (to be maximized) and costs (with constraints) 
may be unbounded from above and from below, have not been studied. 

On the other hand, as is known, continuous-time MDPs in Polish spaces 
have been studied in [11, 12, 16, 27, 34]. However, the treatments in [12, 
16, 27] are on the unconstrained case, whereas the results in [11] for the 
constrained case cannot be applied to the case in which the criterion to be 
maximized is unbounded rewards. This is because the cost to be minimized 
in [11] is required to be nonnegative. Moreover, the study in [11, 12, 16] with 
unbounded transition rates is limited to the class of Markov policies, and 
yet the case of randomized policies depending on state histories in [27, 34] 
is for bounded transition rates. Hence, as noted in [15, 17, 40], the study on 
unconstrained continuous-time MDPs with unbounded transition rates and 
history-dependent policies is an unsolved problem. 

Constrained continuous-time MDPs with unbounded transition rates and 
policies depending on state histories have not been studied yet, and they will 
be considered in this paper. More precisely, we will deal with constrained 
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continuous-time MDPs, which have the following features: (1) the transi- 
tion rates may be unbounded; (2) the reward and costs are admitted to 
be unbounded from above and from below; (3) the state and action spaces 
are Polish spaces; (4) admissible policies can be randomized and depend 
on state histories; and (5) the optimality criterion is to maximize expected 
discounted rewards, and several constraints are imposed on expected dis- 
counted costs. 

First, we give the conditions under which we ensure the nonexplosion of 
underlying processes induced from unbounded transition rates and random- 
ized policies depending on state histories (see Theorem 3.1 below). This re- 
sult is a natural extension of the corresponding regularity of a jump Markov 
process in [5, 12, 15, 16, 31] to a so-called "non-Markov" case and also a 
generalization of the regularity in [18, 26-28, 30, 34, 37, 39, 40] for bounded 
transition rates. Inspired by the condition for the nonexplosion, we obtain a 
condition (see Theorem 3.3 below) for the finiteness of the expected discount 
rewards/costs of each policy when rewards/costs are unbounded. 

Second, as in [1, 2, 19-21, 29, 33, 35] for constrained MDPs, by intro- 
ducing an occupation measure, we prove that the constrained optimality 
problem in continuous-time MDPs [see (2.12) below] can be transformed 
into an equivalent optimality problem [see (3.3) below] over a class of some 
probability measures. The standard weak convergence technique used in 
[11, 19, 20, 22, 27, 29] for nonnegative costs does not apply directly to 
the case wherein rewards/costs are unbounded from above and from below. 
Therefore, to solve the equivalent optimality problem in which rewards/costs 
may be unbounded from above and from below, we introduce (Definition 3.7 
below) a so-called w-weak convergence of probability measures. This w-weak 
convergence is an extension of the standard weak convergence of probability 
measures. Using the properties of the w-weak convergence and occupation 
measures developed here (see Theorem 3.5 and Lemmas 3.8 and 3.9 below), 
we prove the existence of a constrained optimal policy under mild reasonable 
conditions (see Theorem 3.11 below). These conditions are slightly different 
from the usual continuity-compactness ones in [12-15] for continuous-time 
MDPs and in [1, 2, 19, 20, 22, 29] for the discrete-time MDPs, and thus 
they are weaker than those in the literature [12-15, 37]; see Remarks 3.10 
and 3.12 for details. 

Third, for the solvability of constrained optimal policies, we further trans- 
form the equivalent optimality problem to a linear programming (LP) prob- 
lem [see (3.9) below] by using the properties of occupation measures again. 
Then we present the relationship between a constrained optimal policy and 
an optimal solution to the LP (see Theorem 3.13 below), and characterize 
a stationary policy (see Theorem 3.15 below). This relationship and char- 
acterization of a stationary policy are used to obtain the solvability and 
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structure of a constrained optimal policy (see Corollary 3.14 and Theorem 
3.16 below). 

Finally, to illustrate our main results, we present two computable examples 
in which our conditions are satisfied, whereas some of those in [11, 19, 20, 
22, 27, 29] fail to hold (see Remark 4.7 below). In particular, our approach 
is also suitable to the case of discrete-time MDPs with rewards/costs being 
unbounded from above and from below, and similar results for the discrete- 
time case can also be obtained; see Remark 3.17 for details. However, our 
model cannot be transformed to an equivalent one of discrete-time MDPs 
using the uniformization technique because the transition rates in our model 
may be unbounded. 

The rest of this paper is organized as follows. In Section 2, the model 
and the constrained optimality problem that we are concerned with are 
introduced. The main results of this paper are stated in Section 3, and 
illustrated with computable examples in Section 4. The proofs of the main 
results are presented in Section 5. 

2. The model for constrained continuous-time MDPs. 

Notation. If X is a Polish space (i.e., a complete and separable metric 
space) and w > 1 is a real- valued measurable function on X, we denote by 
B(X) the Borel a- algebra on X, by D c the complement of a set DC! (with 
respect to X), by \\u\\w the to-weighted norm of a real- valued measurable 
function u on X [i.e., \\u\\w := sup xg j^ \u(x)\/w(x)], by Cb(X) the set of all 
bounded continuous functions on X, and by V(X) the set of all probability 
measures on £>(X). Let 

Byj{X) := {u\\\u\\yj < oo} 

be the Banach space. 

We now introduce the model of constrained continuous-time MDPs, 

(2.1) {S, (A(x) <ZA,x£ S),q(-\x,a),r(x,a), (c n (x,a),d n , 1 < n < N)}, 

where S is a state space, A is an action space, and A(x) is a Borel set of 
admissible actions at state x £ S. We suppose that S and A are Polish 
spaces, and the following set: 

(2.2) K :={(x,a)\xeS,aeA(x)} 

is a Borel subset of S x A. 

The function q(-\x,a) in (2.1) refers to transition rates, that is, it satisfies 
the following: 

(Ti) For each fixed (x,a) G K, q(-\x,a) is a signed measure on B(S), 
whereas for each fixed D S £>(£), q(D\-) is a real-valued Borel-measurable 
function on K; 
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(T 2 ) < q{D\x,a) < oo for all (x,a) G K and x £ D G B(S); and 
(T3) q(S\x,a) = for all (x,a) G K. [Hence, q({x}\x,a) is finite for all 
(x,a) G K] 

The model is also assumed to be stable, which means 
(2.3) q*(x):= sup \q({x}\x,a)\ < 00 Vx G S. 

Finally, the function r(x,a) on IT denotes the reward, whereas the functions 
c n (x,a) on K and the real numbers d n denote the costs and constraints, 
respectively. We assume that r(x,a) and c n (x,a) are real-valued measurable 
on K. [r(x,a) is allowed to take positive and negative values, so it can be 
interpreted as a cost rather than a "reward" only.] 

To complete the specification of the constrained optimality problem, we 
of course need an optimality criterion. This requires the definition of a class 
of policies admissible to a controller. To do so, we introduce some notation 
as in [24, 27, 28]. 

Let 5*00 := S U {xqo} with being an isolated point, Q° := (S x ]R + )°° 
with 1BL|_ := (0, 00) and Q, := 0° U {(xo,9i,x±, . . . , 9k-i, %k-i-> °°, x oo, ■ ■ -)\9i G 
R+, xq, X[ G S for each 1 < I < & — 1 and k > 2}. By the corresponding mod- 
ification of the o"-algebra over $7°, we can obtain the basic measurable space 
(O, J 7 ). Then we define maps Tj., X^, (fc = 0, 1, . . .) and & (i > 0) on (f2, J 7 ) 
as follows: for each e := (xq, 9\, x±, ■ ■ ■ ,9k, Xk, . . .) € fi, let 

T fc (e):=0i + ... + fc (forfc>l), 

(2.4) 

Toofc) := lim T fc (e) with T (e) := 0; 
X fe _i(e) :=x fc _i, e fe (e):=6» fe for > 1; 
(2-5) 6(e) := J^ fc %fc<*< T fc+i}( e ) + x °° i {r 00 <t}(e), 

fc>0 

where /£> stands for the indicator function of a set D. Let /ifc(e) = (xq, 9\, x\, . . 
9/., Xk), and call /ifc(e) a fc-component state history. Obviously, these maps are 
measurable on T . In what follows, the argument e = (xq, 9\, x±, . . . , 9k, Xk, ■ ■ •) 
is often omitted. 

Components @k pl a Y the role of inter-jump intervals or sojourn times, Tj, 
are the jump epoches, and Xk denotes the state of the process > 0} 
on \Tk,Tk+\)- We do not intend to consider the process after moment T^, 
so we view it to be absorbed in state Xqq. Hence, we write q , (-|x 00 , aoo) = 0, 
where a^ is an isolated point, and let A(x oc ) := {aoo}, := AU {aoo}. 

Let := [0,oo), and introduce the integer- valued random measure fi* 
on R° x 5 by 

(2-6) n*(dt,dx) = ^2l {Tk<oo} 5( Tkt x k )(dt,dx), 

fc>0 
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where 5 y (-) is the Dirac measure concentrated at any point y. Then we take 
the right-continuous family of cr-algebras {Ft}t>o with Ft : = cr{/i*([0, s] x 
D),se[0,t},DeB(S)}, and let 

V :=<t(B x {0},C x (s,oo)\B eF Q ,C eF s -,s>0), 

where F s ~ := \J t<s Ft. Then, as in [24, 27, 28], a real-valued function on 
x M^. is called predictable if it is measurable with respect to V. 
We next introduce the definition of a policy, which is the same as in [27] 

and a generalization of the corresponding one in [28, 34, 35] for denumerable 

states. 

Definition 2.1. A transition probability 7r from (£1 x R^,? 7 ) onto (A^, 
B(A oa )) such that 7r(A(£t_(e))|e, t) = 1 is called a policy, which can be 
randomized and depend on state histories. A policy is called randomized 
stationary if there exists a transition probability <p from (S,B(S)) onto 
(A,B(A)) such that 4>(A(x)\x) = 1 and ir(da\e,t) = /{ i<Too }(e)0(cZa|£ t _ (e)) + 
^{t>T 00 }( e ) ( ^a CX) (da). We will write such a randomized stationary policy as 
<fi. A randomized stationary policy <j) is called (deterministic) stationary if 
there exists a measurable function / from (S,B(S)) onto (A, B(A)) such that 
4>({f(x)}\x) = 1. Such a stationary policy will be written as /. 

We denote by II, H s and F the classes of all policies, randomized sta- 
tionary policies and stationary policies, respectively. Equivalently, U s is the 
set of all stochastic kernels (f> on A given S such that <f>(A(x)\x) = 1 for all 
x G S, and F is the set of all measurable functions / from S to A such that 
f{x) G for all x G 5. Obviously, F C IT C II. 

Remark 2.2. The requirement of predictability of a policy implies that 
at time t > each policy depends on only the past jump moments Tq,T\, . . . , 
T m < t and the corresponding states xq, . . . , x m G S. This means that a pol- 
icy may depend on state histories. However, the class II is not the complete 
collection of all history-dependent policies. This is because each state history 
hk = {xo, 0i, aii, ... , #fc, Xk) does not include past actions a m (0 < m < k). To 
overcome the shortcoming of the definition of a state history, a possible and 
natural way is to replace hk with a new history (xQ,ao, 9\,..., Xk-i,ak-i,0k, Xk) 
including past actions. If we do so, some results in [24, 28] such as the struc- 
ture of the probability measure in (2.9) and the predictable properties of 
the randomized measure v n in (2.7) and functions m(D\e,t) in (2.8), which 
are required in following arguments, need to be checked one by one. Since 
these desired results for the case of new histories have not been proven, we 
still use the definition of a policy in Definition 2.1, which is the same as in 
[27, 28, 34, 35], and which is also a generalization of the corresponding one 
in [5, 11, 12, 15, 17] for a Markov policy. 
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For each ir £ II, by Definition 2.1 we see that the random measure on 
1°, x S given by 



v n (e,dt,D) := 

(2.7) 



ir(da\e, t)q(D\£ t - (e), a)% t _^ D} (e) 

A 



dt 



for D £ B(S) 



is predictable, and ^({t} x 5) = ^([^,00) x 5) = for all t > 0. Thus, 
for any initial distribution 7 G V(S), Theorem 4.27 in [28] (or Theorem 3.6 
in [24]) ensures the existence of a unique probability measure P^ on (fi, J 7 ) 
such that i^{a;o £ da;} = 7(da;), and i^ 7r is a dual predictable projection of 
the measure n* in (2.6). The expectation operator with respect to P^ is 
denoted by EZ. In particular, and PZ will be written as and P^, 
respectively, when 7 is the Dirac measure located at point x £ S. 

For any fixed ir £ II and 7 € V(S), let us recall how the measure P^ is 
constructed. First, by Definition 2.1 we see that, for each fixed D £ B(S), 
the following function on Q x IR9: 



m(D\e,t):= / vr((ia|e,t)g(L»|^_(e),a)I{ ?t _^ } (e) 

is predictable, and thus (by Lemma 3.3 in [24]) has the following represen- 
tation: 

m(D\e,t) =: I {0} {t)m {D\x ,0) 

(2-8) 



+ ^2 I {T k <t<T k+1 }{e)m k (D\h k (e),t - T k ) 



k=0 

where m k {-\hk{e),t) (depending on it) is a measure on B(S) [for any fixed 
hk(e) and t], mk(D\h k (e),t) is measurable in (e,t) [for any fixed D £ B(S)] 

and m k ({xk}\h k (e), t ) = for all x k £ S and > 0. Let = S,H k = S x 

(R+ x Soo) 70 for A; > 1. Noting that a measure 7 on B(H$) is given, we suppose 

that the measure PZ on B(H k ) has been constructed, then PZ on B(H k +i) 
is determined as follows: 

p;(ix (dt,dx)) 

i^(d/ lfc )/ { ^ +1<00} m fc (dx|/ lfc ,t)e-J? ni *( s l h *' ,, ) d ''d*; 

(2.9) " r 

i?(rx (oO.Xoo)) 

^(^){/ {efe+1 =oo } + / {efc+1 <oo } e-^^( s l^^^}, 
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where T £ B{Hk)- According to the Ionescu Tulcea theorem in [4], there 
exists a unique probability measure PZ on (O^T 7 ), which has projections 
onto the spaces of ^-component state histories satisfying relations (2.9). 

For any given 7 £ V{S) and tt £ II, using (2.8) and (2.9), we now give a 
somewhat informal description of how the process {£t,t > 0} evolves. Sup- 
pose that the process is at state Xk at time t £ [Tk,Tk+i) (k>0). Then, a 
transition from Xk to a set D of states occurs with probability mk(D\hk,t — 
Tfc), or the process remains at Xk with probability 1 — mk(S\h)-,t — Tk)dt + 
o(dt). In the former case, the sojourn time Qk+i of {£t,t > 0} at Xk has a 
distribution with a so-called "density function" e~fo m*(S|/n,,v)<fo_ 

As mentioned above, we do not intend to consider the process after mo- 
ment Too. Thus, we need to give conditions ensuring the nonexplosion of 
{£t,i > 0} [i.e., P£(£t £ <S) = 1]. To do so, we consider the following condi- 
tion. 



Assumption A. There exist a continuous function w > 1 on S and 
constants p,b>0 and a sequence of nondecreasing subsets of S, such 
that: 

(1) J s w(y)q(dy\x, a) < pw(x) + b for all (a;, a) £ iT; 

(2) inf a .^5 fc w(x) t +00 as A; — > 00, with inf := 00; 

(3) Skt $ an d swp aeA M x€Sk \q({x}\x, a)\ < 00 for all k > 1. 

Remark 2.3. We call Assumption A a nonexplosion condition for t > 
0}. Obviously, Assumption A trivially holds when the transition rates are 
bounded; see [18, 26, 27, 30, 34, 37, 39, 40], for instance. Assumption A is 
similar to those in [5, 11, 12, 15, 17] for Markov policies and unbounded 
transition rates, and it can be verified with examples in [5, 11, 12, 15, 17] 
and those below. 



Under Assumption A, we see (by Theorem 3.1 below) that {£t, t > 0} 
is nonexplosive. Thus, for any fixed discount factor a > and an initial 
distribution 7 £ V(S), we define the expected discounted criteria 

poo P 

V a (x,7r,u) := / e~ at I E*[u(£ t -,a)n(da\e,t)]dt, 
(2.10) ^ L 

V a (ir,u):= / V a (x,ir,u)-f(dx) 



for each tt £ II, x £ S and a measurable function u on K, provided the inte- 
grals in (2.10) are well defined. 
In particular, let 

V r (x,ir) := V a (x,ir,r), V r (ir) := V a (n,r) 
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and 



V n (x,ir) :=V a (x,Tr,c n ), V n (ir) := V a (-K , c n ) for n = 1, . . . , N. 

[The finiteness of V r (Tr) and V n (ir) will be ensured in Theorem 3.3 below.] 
Let 

(2.11) [7:={vr|K(vr) < d n ,n = 1, . . . ,iV} and V r (U) := sup V t (tt) 



be the set of constrained policies and the constrained optimal reward value, 
respectively. 

In the following arguments, we assume that the set U is not empty, and 
the discount factor a and the initial distribution 7 as well as the numbers 
d n are fixed. 

Then, the constrained optimality problem under consideration is as fol- 
lows: 

(2.12) Maximize K-(tt) over all tt£U. 

Definition 2.4. A policy tt* G U is said to be constrained optimal if 
V r (ir*) = V r (U). When U = 11, a constrained optimal policy is said to be 
unconstrained optimal. 

The main goal of this paper is to give the conditions for the existence and 
solvability of a constrained/unconstrained optimal policy. 

3. Main results. We state the main results of our work in this section. 
Their proofs are presented later in Section 5. The main results are given in 
three subsections. 

3.1. Conditions for nonexplosion and finiteness. This subsection states 
the results on the nonexposition of > 0} and finiteness of V n (x,ir) and 



For the nonexposition of {£t,t > 0}, we have the following fact. 

Theorem 3.1. Suppose that Assumption A holds. Then, for each tt £ U, 
x E S and t > 0: 



7T6C/ 



V n (ir). 



(a) P X -(T C 
(b) 



00) = 1 and P£(£ t e S) = 1 




1) 



ifp^O, 
ifp = 0. 



P 

w(x) + bt, 
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(c) The analog of the forward Kolmogorov equation holds: 



ir(da\e, s)q(D\£ s _(e),a) ds 



10 J A 

for each D G B(S) with sup^g^, q*(x) < oo. 

The proof of Theorem 3.1 appears in Section 5. 

Remark 3.2. Theorem 3.1(a) establishes the nonexplosion of {£t,t > 0} 
on the probability space (CI, F, P£) (for each policy ir G II and x G S), and 
Theorem 3.1 is an extension of the corresponding results in [18, 26, 27, 30, 
34, 35, 37, 39, 40] for bounded transition rates and in [5, 11-17, 31] for 
Markov policies only. The process > 0} may not be Markovian because 
a policy ir can depend on state histories. 

Inspired by Theorem 3.1, we introduce the following condition. 

Assumption B. Let cq(x, a) := —r(x,a) for (x,a) G K, and w be as in 
Assumption A. 

(1) There exists a constant M > such that, \c n (x, a)\ < Mw(x) for every 
(x, a) G K and n = 0, 1, . . . , N. 

(2) The discount factor a satisfies that a> p, with p as in Assumption A. 

(3) J s w(x) , y(dx) < oo. 

Then the following fact establishes the finiteness of V n (x,ir) and V n (ir). 

Theorem 3.3. Suppose that Assumptions A and B hold. Then, for each 
7r G II and x G S: 

(a) E%[\cn(£t,a)\ir(da\e,t)] < ME%[w(£ t ]\ for allt> andn = 0,1, ... ,7V; 

(b) \V n (x,n)\ < M[aw(x) + b]/[a(a - p)\ and \V n (n)\ < MM{ for n = 
0,1,..., N, where V (x,tt) := V a (x,ir, Co), V (iv) := V a (ir,co),Mi := [a x 
J s w(x)j(dx) + b]/[a(a - p)}. 

Proof. Obviously, this theorem follows from Theorem 3.1(b) and (2.10). 

□ 

3.2. Existence of constrained optimal policies. This subsection states the 
main results on the existence of constrained optimal policies. 

In order to show the existence of a constrained optimal policy, as in [1,2, 
19-21, 29, 33, 35], we introduce a key concept of an occupation measure of 
a policy. 
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Definition 3.4. Fix policies 7r,7ri,7r 2 G II. 

(i) The occupation measure of ir is a probability measure rf on S x A 
concentred on K, which is defined by 

/•oo 

(3.1) J " 

with De B(S), re 13(A). 

(Obviously, rf concentrates on K and depends on ir,a and 7. However, we 
impress 7 and a in the occupation measure for simplicity.) 

(ii) Two policies 7T 1 and 7r 2 are called equivalent if rf = rf . 

(iii) We denote by f] the marginal (or projection) on S of a probability 
measure i| on S x i, and by (jf(e Ti s ) the randomized stationary policy 
(depending on rf), which is determined by the following decomposition of m 

(3.2) ri(dx,da) = Tj(dx)(j) n (da\x). 

Thus, by (3.1) and (2.10), we have V a (x, ir, u) = - f SxA u(x,a)r]' n (dx,da), 
and we can rewrite (2.12) as an equivalent optimality problem: 

1 f 

Maximize — / r(x,a)rj(dx,da) 
a Ik 

(3.3) 



over r] G irf : J c n (x, a)rf(dx, da) < ad n , 1 < n < N 

To solve problem (3.3), we need to seek a certain compactness structure on 
the set of all occupation measures. To do so, we require to characterize an 
occupation measure, and we have the following fact. 

Theorem 3.5. Under Assumption A, the following assertions hold. 

(a) The occupation measure rf (for each fixed tt G II) satisfies the follow- 
ing equation: 

ar] 7T (D) = a-/(D) + / q(D\x , a)^ (dx , da) 
J SxA 

\/D eB(S) with sup q*(x) < oo. 

(b) Conversely, if a probability measure n on S x A (concentrated on K) 
satisfies 

af}(D) = aj(D) + / q(D\x, a)r/(dx, da) 
J SxA 

yDeB(S) with supg , *(x)<oo 

x&D 

and f s \q({x}\x, (j) v )\fj(dx) <oo, then rf =rj, where (jF is as in (3.2). 
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(c) //, in addition, Assumptions B(2) and B(3) are satisfied, and q*(x) < 
Lw{x) for all x G S, with some constant L > 0, then (jfl =<f> for all <j) G II S . 

The proof of Theorem 3.5 appears in Section 5. 

Remark 3.6. Theorems 3.5(a) and 3.5(b) are proved in [35] for continuous- 
time MDPs with uniformly bounded transition rates and in [1, 2, 21] for 
discrete-time MDPs. 



To give a certain convergence of occupation measures, we introduce some 
notation. 

For any real-valued continuous function w > 1 on S, let 

Pis(S x A) := in G P(S x A) J w(x)fj(dx) < ooj. 

Then we define two maps, and T^, as follows: 

Tw : P €] (SxA)^P{SxA), n^T^rj), 
where Tn,{jj) is given by 

I voi x)n( dx D 

(3.4) TtffaXD x r) := jD r _/':,'' VD G B(S) and T G B(A); 

J s w{x)r]{ax) 

T4: P(S x A) — >■ Pw(S x A), »^TM, 
where T^(fi) is given by 

(3.5) TM(D x T) := ^(^^ ™ G B(S) and V G B(A). 

[Since 1 < w < oo on S, we have < f s ^^n(dx) < 1 for any [i G P(S), and 
thus the maps and are well defined.] 

Definition 3.7. The w-weak topology on Pw(S x A) is defined by the 
iD-weak convergence as follows: a sequence {rjk, k > 1} C Pw{S x yl) is called 

to -iD-converge weakly to n G Pw(S x A) (and written as rjk n) if 

lim / u(x,a)r]k{dx,da) = / u(x,a)i](dx,da) 
k ^°°JsxA JSxA 

for each continuous function u(x,a) on S x A such that |u(x,a)| < L u w(x) 
for all (x, a) £ S x A, with some nonnegative constant L u depending on u. 

Obviously, % 7/ implies % — > (the standard weak convergence of 
probability measures). The following lemma establishes the relationship be- 
tween w- and standard weak convergence. 
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Lemma 3.8. For any given real-valued continuous function w >1 on S, 
let {rjk,k = 0, 1, . . .} C Vu,(S x A) and {fi k , k = 0, 1, . . .} C V{S x A). Then: 

(a) T & (rj) € V{S x A) for all r, E V^S x A) and T%(ji) E T^S x A) for 
all fi E ^(5 x A); 

(b) Tt(T w (ri)) = r, for all rjeV^SxA) and T^T^fi)) = n for all fi E 
V(S x A); 

(c) r] k r/o if and only if T^fa) — >■ T^q); 

(d) — >■ ju £/ and on/y ifT^(fj, k ) T^(hq). 

The proof of Lemma 3.8 appears in Section 5. 

To further analyze the properties of occupation measures, we let 



M :~- 



V 



(3.6) 

(3.7) M c :=LeM \J s 



J w(x)fj 7T (dx) <oo,7rEn| <^V W (K) 

(with w as in Assumption A), 
c n (x, a)r)(dx, da) < ad n ,n = 1, . . . , N >. 



SxA 



Lemma 3.9. Suppose that Assumptions A, B(2) and B(3) hold. If, in 
addition, q*(x) < Lw(x) for all x E S, with some constant L > 0, then the 
following assertions hold: 

(a) Ai Q and M. C Q are convex. 

(b) If, in addition, f s g(y)q(dy\x,a) is continuous on K for each fixed 
g E Cb(S), then M Q is closed (with respect to the w-weak topology). 

The proof of Lemma 3.9 appears in Section 5. 

For the solvability of (3.3), by Lemmas 3.8 and 3.9, we introduce the 
following condition. 

Assumption C. Let w be as in Assumption A. 

(1) The functions c n (x,a) and j s g(y)q(dy\x,a) are continuous on K [for 
each fixed g E C b {S) and < n < TV]. 

(2) There exist a measurable function w' > 1 on S and a nondecreasing 
sequence of compact sets K m f K, such that lim^oo va.i( x ^ Km = oo. 

(3) There exist a constant L > such that q*(x) < Lw(x) for all x E 5. 

Remark 3.10. Assumption C(2) is slightly different from the com- 
pactness condition in [19-22, 29] for discrete-time MDPs and [12, 16] for 
continuous-time MDPs. 
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We now state our second main result on the existence of a constrained 
optimal policy. 

Theorem 3.11. Suppose that Assumptions A, B and C hold. Then: 

(a) A4 Q and M C D are metrizable and compact (with respect to the w'- 
weak topology), that is, for any sequence k > 1} in M. Q (or M. C Q ), there 
exists a subsequence {% m ,?ri > 1} and tjq G M q (or M.%) such that such that 



(b) There exists a constrained optimal policy. 
The proof of Theorem 3.11 appears in Section 5. 

Remark 3.12. Theorem 3.11(b) shows the existence of a constrained 
optimal policy. It should be noted that the conditions for Theorem 3.11(b) 
are weaker than those in [12-15, 37] for the class of all Markov policies. This 
is because some assumptions such as the nonnegativity of costs in [13] and 
the absolute integrability condition in [12, 13, 15] are not required here. 

3.3. Solvability of constrained optimal policies. This subsection states 
the results on the solvability of constrained optimal policies. 

First, by (3.3) we see that the original constrained optimality problem 
(2.12) is equivalent to the following constrained minimization problem: 

(3.8) Minimize Vo(ir) over it G {vr|V^(7r) < d n , n = 1, . . . , N}. 

By (2.10) and (3.1), the problem (3.8) can be rewritten into the following 
form: 



which (by Theorem 3.5) is equivalent to the following linear program (LP): 



w' 





n=l,...,N 



(3.9) 




subject to 




n = l,...,N, 





for all D G B(S) with sup q*(x) < oo, 



V 
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Obviously, (3.9) is a linear program over the set of probability measures 
r\ E V(K) satisfying (3.9'). We call (3.9) the primal linear programming for- 
mulation of (2.12). 

Thus, we obtain the following result on the solvability of constrained 
optimal policies. 

Theorem 3.13. Under Assumptions A, B and C(3), the following as- 
sertions hold. 

(a) If there exists a feasible solution to LP (3.9), then the set U of con- 
strained policies is nonempty. Conversely, ifU is nonempty, then there exists 
a feasible solution to LP (3.9). 

(b) If there exists an optimal solution rf to LP (3.9), then the randomized 
stationary policy (jf 1 is constrained optimal. Conversely, ifir* is constrained 
optimal, then rf* is an optimal solution to LP (3.9). 

(c) //, in addition, U ^ and Assumptions C(l) and C(2) are satis- 
fied, then an optimal solution r/* to LP (3.9) exists, and the policy (ffl is 
constrained optimal. 

The proof of Theorem 3.13 appears in Section 5. 

In particular, when S and A{x) are finite, then LP (3.9) is the form of 



which is a LP and can be solved by many methods such as the well-known 
simplex method. 

To state the structure of constrained optimal policies, we need to recall 
some concepts. We say that under (f> £ II S , there are m(x,(p) randomizations 
at x E S if there are m{x,4>) + 1 actions a E A(x) for which <f>{a\x) > 0. 
When S and A(x) are finite, we call #(^>) := Y2 x eS m ( x i </0 the number of 
randomizations under <f>. 

Thus, following Theorem 3.8 in [1] and Theorem 3.13 above, we have the 
following fact. 




xeSa£A(x) 



(3.10) 




x&S aeA(x) 




a£A(x) y&S a£A(y) 

Vx E S, r](x, a) > 0, x E S,a E A(x), 
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Corollary 3.14. Suppose that S and A(x) are finite. Let rf be an 
optimal basic solution to LP (3.10). Then, the policy (p v is constrained 
optimal, where (f/ 1 is given by 



(3.11) <f>^(a\x) 



rj*(x,a) 
r)*{x) 



when fj* (x) := r]*(x,a)>0 

a£A(x) 

and a G A(x), 



I{a(x)}{ a )i whenfi*(x) = and a&A(x), 
for all x £ S, a(x) G A(x) is chosen arbitrarily. Further, ^{(jf ) < N. 

Corollary 3.14 provides the structure of a constrained optimal policy for 
finite S and A(x), and it is proven for the case of denumerable states and 
a single constraint in [13, 42]. For a more general case of Polish spaces, we 
have the following facts, in which the first one (i.e., Theorem 3.15) establishes 
the relationship between stationary policies in F and extreme points in M , 
and the second one (i.e., Theorem 3.16) shows a structure of a constrained 
optimal policy. 

Theorem 3.15. Suppose that Assumptions A, B(2), B(3) and C(3) 
hold. Then: 

(a) rf is an extreme point in Ai Q for each f G F. 

(b) If, for each <fi G II S and D G B(S) with fj^(D) > 0, there exists state 
x G D (depending on D and <f>) such that fj^({x}) > 0, then n is an extreme 
point in M. Q if and only if there exists a policy f G F such that r\ = rf . 

/The condition in Theorem 3.15(b) is satisfied when S is denumerable./ 
The proof Theorem 3.15 appears in Section 5. 

Theorem 3.16. Suppose that Assumptions A, B, C and the conditions 
for Theorem 3.15(b) are satisfied. Then, there exists a constrained optimal 
policy 7r* G n s , which is a mixture of (N + 1) stationary policies, that is, 
there exists (N + 1) numbers p n > and policies f n G F (1 < n < N + 1) 
such that ir* = ( j ) (Piv fl +-+PN + iv fN + 1 ) andpi + ... + pN+l = i. 

The proof of Theorem 3.16 appears in Section 5. 

Remark 3.17. The arguments of Theorems 3.11, 3.13, 3.15 and 3.16 do 
not depend on the data in model (2.1), but they are based on Theorem 3.5. 
Thus, the discrete-time versions of Theorems 3.11, 3.13, 3.15 and 3.16 are 
still true because Theorem 3.5 is established in [1, 2, 21] for discrete-time 
MDPs. 
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4. Examples. In this section, we illustrate our conditions and main re- 
sults with examples. 

Example 4.1. Let S := (-00,00), A(x) := [/3 ,/3(|:r| + 1)] for each x G S 
with some constants < (3q < f3. Suppose that the reward r(x,a) and costs 
c n (x,a) (1 < n < iV) are given. We consider the transition rates q(-\x,a) 
given by 



l(D\x,a):=(\x\ + l) 



(4.1) 



D-{x} 



f(y\x,a)dy - 5 X (D) 



for (x,a)eK,DeB(S), 

where f(y\x,a) := e -(v- x ) 2 /( 2a ) [ s the density function of Gaussian dis- 
tribution N(x,a). 

We now aim to find conditions that ensure the existence of constrained 
optimal policies for Example 4.1. To do so, we need the following hypotheses. 



Assumption D. Let a,^,d n and U (7^0) be as in (2.11). 

(1) a > 6/3 and J s x "f(dx) < 00 (hence, there exists a constant p such 
that 6/3 < p < a); 

(2) c n (x,a) (0 < n < N) are continuous on K and |c n (a;,a)| < L'(x 2 + 1) 
for all (x, a) £ K, with some constant V > 0, where co(x,a) := —r(x,a). 

Then, we have the following result. 

Proposition 4.2. Under Assumption D, Example 4-1 satisfies Assump- 
tions A, B and C. Therefore (by Theorem 3.11), there exists a constrained 
optimal policy for Example 4-1- 

Proof. For each m > 1 and x G S, let 

S m :=[-m,m], K m := {(x, a)\x G S m , a G A(x)}, 

w'(x) := x 2 + 1, w(x) := x 4 + 1. 

To verify Assumption A, it suffices to verify Assumption A(l) because As- 
sumptions A(2) and A(3) follow from (4.2) and (4.1). Indeed, by (4.1) and 
a straightforward calculation, we have 

(4.3) ,S 

< f3w(x) + b for some constant b > 0, 
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which implies Assumption A(l). 

Obviously, Assumption B follows from (4.3) and Assumptions D(l) and 
D(2). 

To verify Assumption C, for any g 6 C&(<S), by (4.1) we have the following: 



S 



g(y)q(dy\x,a) = (\x\ + 1) 



DO 1 

g(y) e -(y-^/Wdy-g(x) 

oo V 27TO 



which, together with the dominated convergence theorem, implies Assump- 
tion C(l). Therefore, Assumption C holds because Assumptions C(2) and 
C(3) follow from (4.1) and (4.2). 

Using Example 4.1, we present computable examples for unconstrained 
optimal policies. 

Example 4.3. With the same data as in Example 4.1, we further sup- 
pose that r(x,a) in Example 4.1 is given by 

(4.4) r(x, a) := px 2 — 8a 2 for (x,a) & K, 

where p, 8 > are fixed constants. 

Assumption E. Let f3 and (3 be as in Example 4.1, and V as in As- 
sumption D(2). 

(1) d n > L'[a f s x 4 j(dx) + a + b]/[a(a - /?)] for all 1 < n < N), with b := 
/3(?f + 2) 2 ; 

(2) 2a/3 -/3q < | <min{a 2 ,2a/5-/3 2 }, with p,6 as in (4.4). 



Proposition 4.4. Suppose that Assumptions D and E hold. Then: 

(a) Example 4-3 satisfies Assumptions A, B and C. Moreover, V r (U) = 
J s u(x)j(dx), where 

u(x) = (25a - 2y / 5 2 a 2 - p5)x 2 + ^45a - Ay / 5 2 a 2 - p5 - — ^ \x\ 

+ 25a-2y/8 2 a 2 -pS--. 

a 

(b) The stationary policy f* is unconstrained optimal for Example 4-3, 
where 



f*(x):= [a- x /a 2 -^j(\x\ + l) Vx G S. 

Proof. Note that Assumptions E(l) and D imply that U = U (by The- 
orem 3.3), and so the problem (2.12) becomes an unconstrained optimality 
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problem. Thus, as in Proposition 4.2, under Assumptions D and E, we see 
that all assumptions in Theorem 3.3 in [12] are satisfied. Hence, Theorem 
3.3 in [12] ensures the existence of a function u in B W (S) such that, for each 
x G S and 7r G II, 



(4.5) au{x) = sup <r(x,a)+ / u(y)q(dy\x, a) > and u{x) > V r (x, tt). 

aeA(x) I Js J 

To obtain the analytic expression of u, we assume for a moment that 

(4.6) u(x) := I2X 2 + l\x + Iq for x G S, with some constants l\, I2, h- 
Then, using (4.1), (4.4) and (4.5), by a straightforward calculation we have 



H\x\ + \) 



U(\x\ + l) 



4<5 



(4.7) a(l2X 2 + l\x + Iq) = sup <px 2 — 5ia — — „V — -| + 

aeA(x) I V 

which implies that f*(x) := ^fr^ attains the maximum of the right-hand 
side of (4.7). Therefore, by Theorem 3.3 in [12], we have 



V r (x, /*) = u(x) and a(l 2 x + l\x + lo)=px + 



2 , ^ 2 (l^l + l) 5 



(4.8) 



45 



VxGS. 



Comparing with the coefficients of both sides in (4.8), we obtain 

if x > 0, 
otherwise, 



(4.9) al 2 =p + 



72 

_2_ 

45' 



72 

_2_ 

25' 

__2_ 

25' 



air 



72 

45' 



Under Assumption E, solving the system of equations (4.9) gives 

p 
a 



l 2 = 25a-2y / 5 2 a 2 -p5, l = 25a - 2^5 2 a 2 - p5 ■ 



45a - 4\/5 2 a 2 - p5 -, 

a 



A5a - 4\j5 2 a 2 -p5 



2 P 



if x > 0, 
otherwise, 



which, together with (4.6) and (4.8), yields 
u(x) = {25a - 2^5 2 a 2 - P 5)x 2 + (45a - 4^5 2 a 2 -p5- ^ 



x 



+ 25a - 2y / 5 2 a 2 - P 5 - 

a 



/•(*) 



a 



a- 



P 



(\x\ + l)£A(x) and V r (x, /*) = u(x) \/x G S. 
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This, together with (4.5) and (2.10), completes the proof of this proposition. 

□ 

Example 4.5. Let S := (-00,00), A(x) := [0,/3(|x| + 1)] for each x G 
S with some constant j3 > 0, and the reward r(x,a) and transition rates 
g(-|x,a) are defined as follows: for each (x,a) G -RT and I? G B(S), 

t(D\x,a) := (P\x\+a) 



Id-{x} y/2ir(P(\x\ +l)-o + l) 

x e -( tf - a! )V(2^(| a! |+l)- +l)) dy _ 6x{D) 

r(x, a) := p\x\a — da 2 for (x, a) G K, with p, 5 > 0. 

Assumption E. a > /3 2 ; f s x 2 ^/(dx) < 00; and /3 > max{l, ^}. 

Then as the arguments for Example 4.3 in Proposition 4.4, we have the 
following results. 

Proposition 4.6. Under Assumption E, Example 4-5 satisfies Assump- 
tions A, B and C. Moreover, if, in addition, U = H, then V r (U) = f s u(x)'y(dx), 
where 

u(x) = -8{\fn+ 1 - l)x 2 

+ — [p(\/k+T - 1) + nSp] (/3 + 1) ( V^+T - 1) \x\ 
Ian 

+ -^5(/3 + l) 2 (V^+T-l) 3 

2 

with k := grr^pTs > 0, and the following stationary policy f* is uncon- 
strained optimal: 



p(v^TT-l). 1 



/*( x ) := £iV__r (/3 + 1)( ^ T T_ 1) 2 V2 . e5 _ 

OK ZK 

Proof. The proof of Proposition 4.6 is similar to that of Proposition 
4.2, and thus the details are omitted here. □ 

Remark 4.7. In Examples 4.1, 4.3 and 4.5, the transition rates are 
unbounded, and the reward and costs are allowed to be unbounded from above 
and from below. In contrast, the transition rates in [18, 26, 27, 30, 37, 39, 40] 
are assumed to be bounded, and the costs in [11, 19, 20, 22, 27, 29] are 
assumed to be nonnegative. Moreover, Examples 4.3 and 4.5 seem to be first 
computable examples for the unconstrained optimal policies for discounted 
continuous-time MDPs in Polish spaces. 
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5. Proofs of the main results. In this section, we give proofs of Theorems 
3.1, 3.5, 3.11, 3.13, 3.15, 3.16 and of Lemmas 3.8 and 3.9, which are stated 
in Section 3. 

To prove Theorems 3.1, we need the following two lemmas. 

Lemma 5.1. Suppose that real-valued measurable functions w > on S 
and qt(D\x) on x B(S) x S satisfy the following: for each t > 0, D G B(S) 
and x G S: 

(1) qt(-\x) is a signed measure on B(S) such that qt(S\x) = 0, qt{D\x) > 
for all x ^ D and qt(x) := qt(S — {x}\x) < oo; 

(2) j s w{y)q~t{dy\x) < pw(x) + b, with constants p^O and b>0. 

Then nonnegative function 

(5.1) h(s,x,t) := e^-^w{x) + ^( e p(*-s) - 1) 

P 

satisfies the following inequality: 

e~ J"" ^ dv q z (dy\x)h(z, y, t) dz + e~ £ ^ dv w{x) < h(s, x, t) 

S-{x} 

for all x G S and < s < t < oo. 

PROOF. Under conditions (1) and (2), a straightforward calculation 
gives 

' f e-f*^ d *q z (dy\x)h(z )y ,t)dz 

S-{x} 

+ w{x)q z (x) + ^q z (x) J - ^q z (x) dz 
P J P 

= h(s,x,t)-e-fs^ {x)dv w(x), 
which verifies this lemma. □ 

Lemma 5.2. Suppose that Assumption A(l) holds for p^O. Then, for 
any it G II and x G S, 

ExHZt)I{t<T k+1 }] < e pt w(x) + -(ef* - 1) VA; > and t > 0, 

where w and b are from Assumption A(l). 
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Proof. Fix any vr G IT, / > 1, and (x ,9i,xi, . . .,xi-i,9i) G (S x R° ) z . 
Let mz(-|/i;,i) be as in (2.8). Then, it follows from Assumption A(l) that 
the following function on x B(S) x S: 



q t (D\x) :-- 



( mi(D\x ,9 1 ,x 1 , ...,9i,x,t), ifx^D, 
\ -mi(S\x ,9 l ,x 1 ,. . . ,9i,x,t), if D = {x}, 

satisfies conditions (1) and (2) for Lemma 5.1. 

Let h(s,x,t) := eA^wix) + J(e^" s ) - 1) for all x G S and t > s > 0. 
Then, for each fixed x G 5 and < s < t, by Lemma 5.1 we have 

/ / mi(dy\hi-i,9i,x,z-Ti)h(z,y,t) 

Js JS~{x} 

(5.2) = / / mi(dy\hi-i,9i,x,u)h(u,y,t-Ti) 

Js-T[ JS-{x} 

x e~^- T i m ^ s \ h i-^' e ^ x ^) di ' (fa 

+ W(x)e~ m l( S \ h l-l> e l,x,v)dv 

< h(s — Ti,x, t — Ti) = h(0, x, t — s). 
Moreover, by (2.5) and (2.9), we have 

EZWZt)i{t<T k+1 }\rT k ] 

= e -fo- Tk rn k (S\ hk ,v) dVw[xk)I{T ^ t} + j {n>t} I {Tm _^ t<Tm}W ( Xm ^) 

m=l 

Now, using (5.2) at I = k, s = Tk = Ti,x = x k = xi, gives 
El[w^ t )I {t<Tk+l) \T Tk ] 

k 

< I { T k <t}HT k ,X k ,t) + I{T k> t} kTm-i^tKTmywiXm-l), 

m=l 

which implies that the following (5.3) holds for n = 0: 

El[w{it)I {t<Tk+l} \F Tk _ n ] 

k—n 

(5.3) < I{ Tk _ n <t}H T k-n,X k -n,t) +I{T k _ n >t] ^ I {T m -x<t<T m }^ mr .x) 

m=l 

V/c > n > 0. 
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Suppose that (5.3) holds for some < n < k. Then, by (2.9) we have 



< El 



I{T k - n <t}h{Tk-n, Xk-n, t) 



k—n 



+ I {T k -n>t} Yl hT m -i<t<T m }w{x m -i) Tt^^ 
m=l 

E Z[I{T k _ n <t}h( T k-n-,Xk-n-,t) 

+ I {T k -n>t} I {T k -n-i<t<T k _ n }W(x k - n - 1 )\J r T k _ n _ 1 ] 
k—n— I 

+ I {T k -n>t} Yl / {Tm-l<*<rm} U, ( a: *n-l) 



m=l 
t—T k - n -i 







+ e" 

k—n—l 



m k -n-i{dy\h k - n -i,t) 

S—{xk-n-i} 

x h(Tk- n -i + t,y,t) 

X e~ $0 m k-n-l(S\h k - n -i,v) dv ^ 



■So k n 1 m fc-"-l( S 'l h '=-"-l' C ) di '-u;( a ; fc „ n _ 1 ) 



+ / {Tfc_n-l>*} Y 1 {T m -x<t<T m } w {x m -\ l 



which together with h(Tk- n -i +t,y,t) = h(t, y, t — Tk~ n -i) and (5.2) again, 
gives 

^>(6)W fc+1 }l^--n-J 

- I{T k _ n _ 1} <t}h(Tk-n-l,Xk-n-l,t) 
k— n— 1 

+ I {Tk-n-i>t} Y J {T ro -i<t<T m }^(^fc-l)- 
m=l 

Hence, (5.3) holds for all < n < k, and so this lemma follows from (5.3) at 
n = k. □ 

Proof of Theorem 3.1. (a) We first prove the following fact: 

(5.4) P^(&I {Tfc < t<Tfc+l} ^: for some &>0)->0 as/^oo. 

To prove (5.4), let Fi := {e : £,t( e )I{T k <t<T k+1 }( e ) $ Si for some k > 0} for any 
/ > 1. 
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Suppose that, for some e > and any L > 1, there exists / > L such that 

(5.5) PI (T,) = P*({e : & (e)I {Tk <t<T k+1 }( e ) t S l for some k ^ °» > e - 
Then, by Assumption A(2), we can take the corresponding Z such that (5.5) 
holds and also the following inequality: 



(5.6) w(y) > 



e P t w{x) + -(eP t -1) 
P 



e Vy£S u 



is satisfied, where p := \p\ + 1. 

For the taken I > 1 in (5.6), let us define new transition rates q(D\x,a) as 
follows: 

The quantities such as probabilities corresponding to q(D\x,a) are equipped 
with the tilde. 

We next to prove that 

P x^tI {Tk <t<T k+l} e 5, for all k > 0) 

= PZm {Tk <t<T k+1 } G Si for all fc > 0). 

Indeed, it is obvious that 

P£(X G S t ) = P£(X G S,) = I Sl (x). 

Let X\ := X k I {Tk < t<Tk+l} . Then, by (2.5) we have {£tI{T h <t<T k+1 } G 5";} = 
{X£ G 5^}. We now suppose that for some n > 0, 

i?({4eS,,o<fc< n }nr) 

(5-8) 

= p;({4e5i,o<Kn}nr) vrG£(tf n ), 

where and P£ are regarded as the marginal on H n+ i. 

Using the notation in (2.8) and (2.9), for any D G 6(5), < ti<t 2 < oo, 
we have 

P£({Xl eSi,0<k<n, and X l n+l G S{\ H {T x (t u t 2 ) x D}) 

x m n (5i n D\h n , t )e- & m ^ s \ h ^) dv di 
= It It ^^ dhn ^ I ^ xt k eSlfi ^ k ^ I ^ x n+i eS i nD } 

X m n (Sir\D\h n ,t)e- Io^n(S\h n ,v)dv ^ 

= P£({X t k e Si,0<k<n, andX* +1 G n {r x (t l9 t 2 ) xD}), 



(5.11 
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which together with the arbitrariness of D 6 B(S) and < t\ < ti implies 

(5.8) for n + 1, and thus (5.7) follows from the induction. 
Thus, from (5.5) and (5.7), we have 

(5.9) J*(T,) = PZ&I { T k <t<T k+1 } i St for some k > 0) > e. 

Moreover, since ||g|| := sup xeSiaeA(x) \q({x}\x, a)\ = sup x&SliaeA{x) \q{{x}\x, 
a) | < oo, we now show by induction that 

(5.10) El[e- T *} < [I_ e -ll9ll(i_ e -i)] fc VA;>1. 

In fact, by (2.8) we have \fhk(S\hk) < \\q\\ for all k > 1, and it follows from 
(2.9) that 

/•l roc 

E£[e- Tl ]= / m (S\x)e-™ o{s \ x)t e- t dt+ / m Q {S\x)e-™ o{s \ x)t e" 1 dt 
^ Jo Ji 

< 1-e-ll^ll /" e-*di = [l-e-ll«H(l-e- 1 )]. 
■/ o 

Suppose that (5.10) holds for some k > 1. Then, as the arguments of (5.11), 
from (2.8) and (2.9) we also have El[e~ T ^\ < ^[e-^fl-e-ll^^l-e" 1 )]] < 
[1 - &-W^\{\ - e -1 )] fc+1 , and so (5.10) follows. Hence, by (5.10) and the 
Chebychev inequality we have 

P^Too <t)< P£(T k <t) = PZ(e- T * > e~ l ) < e'E^e^] 
< e *[l_ e -||9ll(l_ e -i)]fc 

for all k > 1, and so P^ r (T oc >t) = l. Since t > can be arbitrary, we have 
P^(Too = oo) = 1, and therefore, YlT=o Px X^k < t < T^+i) = 1. Since As- 
sumption A(l) still holds when p and g(Z?|x,a) are replaced with p and 
q(D\x,a), respectively, by Lemma 5.2 we have 

(5.12) EZ[wfo)] = lim E*[w&)I {t<Tk+l} ] < e&w(x) + - 1). 
On the other hand, using (5.6) and (5.9), we see 

^Ke t )] = ^Ke t )|r,]^(ro + ^Ke t )|rf]^(rf) 

>e^(x) + -(e^-l), 
P 

which contradicts to (5.12), and thus (5.4) is proved. 

Since r^ +1 C T/ for all Z > 1, by (5.4) we conclude that ^(Hzx)-^) = 0> 
and so 

P^({for each I > 1, there exists k such that £tI{T k <t<T k+1 } ^ ^}) = 0- 
(5.13) 
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Since {inf{s : £ s ^ Si} < t} C {£i^{T fe <t<T fe+1 } ^ 5/, for some k > 1}, by (5.13) 
we conclude P^ (inf {s : f, G" Sj} < t, Z = 1, . . .) = 0, and thus PJ(inf{s : £ s ^ 
S/} > t, for some / > 1) = 1, or, equivalently, P^iCs G 5/ for all s G [0,i], for 
some I > 1) = 1. For any A; > 1, let P& := {£ s € S 1 ; for all s G [0, fc] , for some / > 
1}. Then, P fc+i C P fc and P£(B k ) = 1 for all k > 1, and thus PJ(f|fcLi s fe) = 
1, which together with (2.5) implies P^iT^ = oo) = 1. To further prove 
P?(6 G S) = 1, using the facts £ fc >o PJ(T fe < t < T k+l ) = P^T^ = oo) = 1 
and P^(£i G 5|T fc < t < T k+1 ) = 1 for all k > 1, we have that PJ(& G 5) = 
E fe >o P * (& G 5|T fe < t < T k+1 )P£(T k < t < T k+X ) = 1, and thus (a) follows. 

(b) First, consider the case of p ^ 0. Since YH^Q^xi^k <t< Pfc+i) = 1 
for all t > 0, 



w(tt)Yj{T k <t<T k+l } 



k=0 



lim E?[w(£ t )I{t<T k+1 }], 

k— >oo 



which together with Lemma 5.2 implies the first part of (b). Moreover, the 
results for the case of p = can be obtained by letting p 1 0. 

(c) Define an integer- valued random measure p* on i3(]R+) x B(S) 



(5.14) 



p*(dt,dx) --^I^^S^x^id^dx), 



k>l 



which counts the exits from dx. Then, as Lemma 4.28 in [28], the random 
measure 



v*(e, dt, dx) :- 



n(da\e, t)q(dx\£ t - (e) , a)I dx (e) ) 



dt 



is a dual predictable projection of the measure p* with respect to V and 
P^ (for any fixed policy tt G II and initial distribution 7). Hence, by (4.5) in 
[28] we have 



E%\fi*((0,t},D)} = E*[v*((0,t],D)] 



< El 



n{da\e, s) sup q*(x) ds 



< 00 



Vt>0, 



which together with \p*((0,t],D) - fi*((0,t],D)\ < 1 and (4.5) in [28] again, 
implies 

E%\ M *((p,t],D)]=EZ[v*((0,t],D)]<oo. 

Thus, using the obvious representation I^ teD y = Id(x) + p* ((0,t], D) — p*((0, 
t],D), by taking the expectation E% of the representation we see that (c) is 
true. □ 
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Proof of Theorem 3.5. (a) For the given D, by Theorem 3.1(c) and 
(3.1) we have 



roo r rt r 

^{D)= 1 {D) + a / e~ at E^ / / vr 

JO UO J A 



(da\e, s)q(D\^ s ^ (e), a) ds 



di 



7 (D) + a \ \ q(D\x,a) 
Js Ja 



x e at E*[Tr(da\e,s)I{£ s _( e )y(dx)ds]dt 
Jo Jo 

= 1 (D) + - [ [ q(D\x,a)rj n (dx,da), 
« Js J A 

and so (a) follows. 

(b) Recall that rj(dx,da) = 7](dx)(j) ri (da\x). Then, to prove (b), it suffices 
to show 



(5.15) 



S J A 



u(x,a)rj(dx,da) = / / u(x , a)rj^ (dx , da) 



SJA 



aV a (x, <p v , u) = I u(x,a)(j) ri (da\x) 
A(x) 



+ / V a {y,<F,v)q{dy\x,<p) Vx G 5. 



for each nonnegative bounded measurable function u on K. In fact, for any 
such a function u, by Lemma 5.3 in [12] and (2.10) we have 



(5.16) 



On the other hand, let ||iz||i := sup^ ^g^- \u(x, a)\ < oo, and \q(dx\x, (/> r? )| the 
total variation of q(dy\x,(j) n ). Then, by (T2)-(T3) and the condition in (b) 
we have 

\V a (y,4>' n ,u)\\q(dy\x,()) ll )\fj(dx) < "LJLL f \q({ x }\x, (jf>)\f)(dx) <oo, 

a Js 

which together with the Jordan decomposition of q(-\x,(f) v ) and Theorem 
2.6.4 in [3], implies 



s Js 



[v(dy)q(dx\y,(j) v )]V a (x,(j)' n ,u) = 
Hence, by Assumption A(3) we have 

[f,{dy)q{dx\y,4P)]V a {x,4>\u) 



V a (y,<P\u)q(dy\x,^) 



fj(dx). 



lim 

k— >oo 



(5.17) 



s k Js 



lim 

k— >oo 



V a {y,4>\u)q{dy\x,^) 



fj(dx). 
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Thus, for any fixed k > 1, since sup xgi 5 fe q*(x) < oo, by (5.16) and (a) we 
have 



u(x, a)r](dx, da) 

u(x, a)[fj(dx)(j) ri (da\x)] 



S k JA 



S k JA(x) 

aV a (x,(p v ,u) - / V a (y,4> v ,u)q(dy\x,(f> v ) 
Js 



a V a {x,<P,uyy(dx)+ / V a (y,<P,u) 



Sk 



Sk 



Sk 



fj(dx) 
fj(dx)q(dy\x, (f) 11 ) 



V a (y,^,u)q(dy\x,^) 
u(x, a)^ (dx, da) + 



S k JA 



Sk 



Sk 



V a (y,^,u)q(dy\x,^) 



fj(dx) 

V(dy)q(dx\y,cf) ri ) 
fj(dx), 



which together with (5.17) gives (5.15). 
(c) Since £ n s , by (a) and (3.2) we have 

arf(D) = 07(D) + / q(D\x,(f))fi^(dx) 



= 07(D) + / / g(D|s,a)[7^(ds)0(da|x)] 

VD € 6(5) with sup q* (x) < 00. 
Moreover, under Assumptions A, B(2) and B(3), by Theorem 3.3 we have 



(5.18) / \q({x}\x,(P)\f)^{dx)<L 



a / w(x)j(dx) + b 



I [a(a — p)] < 00. 



Thus, by (b) we see that fj^(dx)(j)(da\x) = ^{dx^da), and so (c) follows. □ 

Proof of Lemma 3.8. (a) Since the first part of (a) follows from (3.4), 
we need to verify the second part of (a). In fact, for each \i €V(S x A), 
by (3.5) we have J s w(x)f^(p J )(dx) = j 1/(t5( 1 x))A(dx) < 00, and so the second 

part of (a) follows. 

(b) By (3.4) and (3.5) and a straightforward calculation, we see that (b) 
is true. 
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(c) and (d). We prove (c) and (d) together. Suppose that % — > r/o- Take 
any bounded continuous function u on S x A. Then, since w is continuous, 
by f]k — > r]Q we have 

lim / v(x,a)w(x)r)k(dx,da) 



k— >oo 



I SxA 

which together with (3.4), imply 



SxA 

v(x,a)w(x)rio(dx,da) for u :=«,!, 



(5.19) lim / u(x,a)Tyj(rik)(dx,da) = / u(x,a)T iD (r]o)(dx,da), 
k ^°°JsxA J SxA 

and thus, T^rjk) T^rjo). 

On the other hand, suppose that fik — > fJ-o, and pick up any continuous 
function u(x,a) on S x A such that \u(x, a)\ < L u w{x) for all (x,a) G K, 
with some nonnegative constant L u depending on u. Then, the functions 
u [ x ' a J and 4 are bounded continuous on S x i. Hence, a straightforward 
calculation gives 

(5.20) lim / u{x,a)T' iS (ijL} : ){dx,da) = I u(x,a)T^j(no)(dx,da). 
k ^°°JsxA J SxA 

By (5.19) and (5.20) and (b), we see that (c) and (d) are both true. □ 

Proof of Lemma 3.9. (a) For any rf 1 ,^ 2 £ M and < /3 < 1, let 
rj := firf 1 + (1 — f3)r] n2 . Then, by Theorem 3.5(a) and a straightforward cal- 
culation we have 

arj(D) = aj(D) + / q(D\x,a)r](dx,da) 
JsxA 

(5.21) 

VL> e B(5) with sup < oo, 

and also f s w(x)fj(dx) = f s w(x)[/3f]' Kl (dx) + (1 — /3)?f 2 (dx)] < oo. Thus, by 
Theorem 3.5(b) and (5.21), there exists a randomized stationary policy <f> ri € 
IT S such that r] = rj^ . Hence, M. is convex, and thus so is A4^. 

(b) Take any sequence {r/ m } in M. such that 7] m t]q (and thus rj m —> 
rjo). Then, under Assumptions A, B(2) and B(3), by Theorem 3.1(b) we have 

f \" (a \ f f \ (a a \ s afs w ( x h( dx )+ b 
w{x)rj m (dx)= / w{x)rj m {dx,da) < — - — r 

(5.22) - h ai "- p) 

= M{ < oo Vm > 1. 
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Thus, by Lemma 11.4.7 in [22] we have 

l({x}\x,4> rio )\f)o(dx) < L / w(x)rjo(dx) < Lliminf / w{x)fj m {dx) 



< LM{ < oo. 

Thus, to prove 770 £ A^o, by Theorem 3.5(b) it suffices to show 
ar]o(D) = 07(D) + / q(D\x,a)r]o(dx,da) 



K 



VD G B{S) with sup q*(x) < 00, 



which can follow (by Proposition 7.18 in [4]) from 

/ g(y)Vo(dy) = a g(y)<y(dy) + / / g(y)q(dy\x,a)r] (dx,da) 
Js Js JSJK 



a 

(5.23) 



V^eCUS). 



Thus, the rest verifies (5.23). For any g G Cb(S), by 7] m £ M. Q and Theorem 
3.5(a) we have 



«/ 9(y)Vm(dy) = a g{y)l{dy)+ I I g(y)q(dy\x,a)rj m (dx,da) 
JS k JS k JS k JK 

(5.24) 

Vfe, m > 1. 

Since q*(x) < Lw(x) for all x £ S, using Assumption A(3) and the dominated 
convergence theorem, by (5.22) and (5.24) with letting k — > 00 we have 



a / g{y)f}m{dy) = a I g(y)j(dy) + / / g(y)q(dy\x,a)r] m {dx,da) 
(5.25) JS ^ JsJK 

Vm > 1. 

On the other hand, since | f s g(y)q(dy\x,a)\ < 2\\g\\iq*(x) < 2L\\g\\iw(x) [for 
all a € v4(x)], by r] m rjo and Assumption C(l), we have 

lim / g(y)fi m (dy)= lim / g{y)r] m (dy , da) = / g{y)r) (dy,da) 

= / g(y)vo(dy) 
Js 

and 



lim 

m— >oa 



g(y)q(dy\x,a)i] m (dx,da) 



g(y)q(dy\x,a)T]o(dx,da), 

S JK 



'S JK 

which together with ( 5.25) give (5.23), and so (b) follows. □ 
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Proof of Theorem 3.11. (a) Since V(S x A) is metrizable, it follows 
from Lemma 3.8 (with w := w) that V W (S x A) is also metrizable, and 
so are M Q and M.%. Since M. Q is closed (by Lemma 3.9) and M.% is a 
closed subset of M. Q under the additional Assumption C(l), it suffices to 
show that M. is sequentially relatively compact. Indeed, for each r\ G Ai -, 
since 1 < f s w'(x)f)(dx) < oo [using Assumption C(2)], T w i(rj) is well defined. 
Moreover, by (3.4) and Theorem 3.3, we have 

w{x) T ,s, dx da) = IsxA w ^ x )^ dx ^ da ) 

Sxaw'(x) w ' f SxA v/(x)rj(dx,da) 

< / w(x)rj(dx, da) < aM* \/rj^M , 

J SxA 

where M| is as in Theorem 3.3(b). Thus, by Assumption C(2) and Prohorov' 
theorem (see Theorem 12.2.15 in [22]) we see that {T w i(rj),r] G A4 Q } is se- 
quentially relatively compact, and so is M. a (by Lemma 3.8 with w := w'). 

(b) Under Assumptions A and B, by Theorem 3.3(b) we have [T^ 71 ")! < 
MM{ and |K(tt)| < MM* for 1 < n < N. Moreover, by Theorem 3.5 and 
(2.12) [equivalently, (3.3)] we can find a sequence {rf k } (tt^ G LT s , k = 1, . . .) 
such that 

V r (U) = lim — I r(x,a)r] Wk (dx,da), 

(5.26) 

/ c n (x,a)rf k {dx,da) <ad n , n = l,...,N. 
Jk 

Then, by (a) there exists a subsequence {?] Wk ™} and r/o G M Q such that 
rf k ™ % as m— > oo, which together with (5.26) implies 

V r (U) = — I r(x,a)rio(dx,da) 

and 

/ c n (x,a)r]o(dx,da) < ad n , n = l,...,N, 
Jk 

and so (f/ 10 is constrained optimal. □ 

Proof of Theorem 3.13. Obviously, parts (a), (b) are directive con- 
sequence of (3.9) and Theorem 3.5. Moreover, (c) follows from (b) and The- 
orem 3.11(b). □ 

Proof of Theorem 3.15. (a) Under Assumptions A, B(2), B(3) and 
C(3), by Theorems 3.1 and 3.5 we have 



M = { V 



[ w(x)fi n (dx)<aM^,TreU 
Js 
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w(x)fj 7r (dx)<aM*,TreU t 



We now prove that ry is an extreme point in Ai for each / G F. In fact, 
for any fixed / G F, suppose that rj* is not any extreme in Ai . Then, there 
exist /3 G (0, 1) and iri, iT2 G II s such that 

(5.27) rj f = fiif 1 + (1 - (3)?f 2 and if^if 2 , 

which implies that ff k (k = 1,2). Thus, it follows from (5.27) and 
Theorem 3.5 that 

drj ni dff 2 

f(da\x) = (3 f (x)ni(da\x) + (1 — 0) — t (x)iT2(da\x) and 

dri-i driJ 

(5.28) 

dfi wi dfi n2 

?w {x)+{1 - p) w {x)=l yx£S 

for some 5 G B(S) with fjf(S) = 1, where denote the (nonnegative) 
Radon-Nikodym derivative. Moreover, by rf 1 ^ rf 2 we see that rjf ({x G 
S\k\ (r|x) ^ 7T2(r|x) for some T G B(A)}) > 0. (Otherwise, rf 1 and ry 71 " 2 coin- 
cide.) Thus, for each x £ {x £ S\iti(T\x) ^ 7r2(r|x) for some T G B(A)}, there 
exists a corresponding F x G S(j4) (depending on x) such that < tti(T x \x) < 
K2(X x \x) < 1. Therefore, by (5.28) we have that < ^(r^jx) < f(T x \x) < 
7T 2(F x \x) < 1, which contracts with the nonrandom of / G F. 

(b) By (a) we only need to show the necessity part. Suppose that ir G II S 
and rf ^ t}* for all / G F. Then, there exists D G B(S) such that < fj n (D) < 
1 and < ir(T x \x) < 1 for all x G D and some T x G B(A(x)) (depending on 
x). Then, by the condition in (b), there exists x' G D such that 

0<fj n ({x'}) <l and 

(5.29) 

< 7r(rV|x') < 1 for some IV G B(A(x')). 

By (5.29), we now define two policies tt\ and 1x2 as follows: 

fK l a I ^ _ ( ir(da\x), ifx^x', 

[b.6U) ^y aa \ x r--\^ da ^T xl \x')/^Y xl \x') 1 iix = x'- 

, . , .. 1 % / 7r(<ia|x), ifx/x', 

(5.31) 7T 2 (da|z) := j ^ p ^ /^tf), if x = ^. 

Let P := vr^lx'), 5' := ^ 2{ J^-^U (.{*>}) when ^(MH^CM) 
0, and 5' = \ when ^({a/}) + ff ri ({x'}) = 0. Then, for each D G 6(5) with 
sup xg £, q*(x) < 00, by Theorem 3.5 and (5.30), (5.31) as well as a straight- 
forward calculation we have 



arj ni 



(D) = a-y{D) + [ q{D\x,Tr)f,' K1 {dx) 
JS-{x'} 
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+ / q{D\x' \a)ir{da\x')ff'{{x'})/ '/3, 



r, 



afi 7T2 (D)=aj(D)+ [ q(D\x , vr)?f 2 (dx) 
Js~{x'} 



+ / q{D\x',a)7r(da\x')fj^({x'})/{l-/3). 

x' 

Multiplying by 5' and (1 — 5') the two equalities, respectively, and then 
summarizing, we have 

a{5'fj ni (D) + {l-8')fi 7T2 (D)} 

= aj(D) + [ q(D\x^)[5'fj 1T1 (dx) + {l-5')fj 1T2 {dx)}, 
Js 

which together with Theorem 3.5(c) implies rf = 8'rf 1 + (1 — 8')rf 2 . More- 
over, by (5.29) we see that < rf 1 ({x'} x iy) = ff 1 {{x 1 }) < 1 and rf 2 ({x'} x 
Y x i) = ff 2 ({x'})it 2 {T x >\x') = 0. Hence, rf = 5'rf 1 + (1 - 5')rf 2 is not an ex- 
treme point. □ 

Proof of Theorem 3.16. Let (ft* be a constrained optimal policy [by 
Theorem 3.13(c)], and M^(e) be the set of all extreme points in in (3.7). 
Since M. C Q has been proved to be convex compact [by Theorem 3.11(a) and 
Lemma 3.9]. Thus, by Choquet's theorem [32], rf** is the barycenter of a 
probability measure p, supported on M%(e). Therefore, 

(5.32) / co(x,a)rj^ (dx,da) = / I / co(x, a)rj(dx, da) ) p(drj). 

JsxA JM c {e)\JsxA J 

On the other hand, since f SxA cq(x, a)r)^* (dx, da) < f SxA co(x,a)rj(dx,da) 
for all 77 £ A^o(e), it follows from (5.32) that there exists rf G A4^(e) such 
that 

/ co(x,a)rf (dx,da) = / co(x,a)rj* (dx,da). 
JsxA JsxA 

Hence, it* := <p v is also constrained optimal. Moreover, since J SxA c n (x, a)rj(dx, 
da) (for each fixed 1 < n < N) is linear in rj £ Ai Q and thus can be regarded 
as a "hyperplane," each extreme point of M c is a convex combination of at 
most N + 1 extreme points in Mo- That is, there exists (N + 1) numbers 
Pk > and stationary policies € F (k = 1, . . . , N+l) (using Theorem 3.15) 
such that rj* = p\ry x + • • • + Pn+iV^ n+1 iPi + • • • +pn+i = 1, which together 
with Theorem 3.15 and (3.2) completes the proof of this theorem. □ 
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